AITopics | english speech

Collaborating Authors

english speech

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

From A for algebra to T for tariffs: Arabic words used in English speech

Al JazeeraDec-18-2025, 05:03:58 GMT

Arabic is one of the world's most widely spoken languages with at least 400 million speakers, including 200 million native speakers and 200 million to 250 million non-native speakers. Modern Standard Arabic (MSA) serves as the formal language for government, legal matters and education, and it is widely used in international and religious contexts. Additionally, more than 25 dialects are spoken primarily across the Middle East and North Africa. The date was chosen to mark the day in 1973 on which the UN General Assembly adopted Arabic as one of its six official languages. In the following visual explainer, Al Jazeera lists some of the most common words in today's English language that originated from Arabic or passed through Arabic before reaching English.

algebra, arabic word, english speech, (12 more...)

Al Jazeera

Country:

Europe > Middle East (0.25)
Africa > North Africa (0.25)
Africa > Middle East (0.25)
(10 more...)

Industry:

Law (0.36)
Government (0.35)

Technology: Information Technology > Artificial Intelligence (0.90)

Add feedback

Improving Direct Persian-English Speech-to-Speech Translation with Discrete Units and Synthetic Parallel Data

Rashidi, Sina, Sameti, Hossein

arXiv.org Artificial IntelligenceNov-18-2025

Direct speech-to-speech translation (S2ST), in which all components are trained jointly, is an attractive alternative to cascaded systems because it offers a simpler pipeline and lower inference latency. However, direct S2ST models require large amounts of parallel speech data in the source and target languages, which are rarely available for low-resource languages such as Persian. This paper presents a direct S2ST system for translating Persian speech into English speech, as well as a pipeline for synthetic parallel Persian-English speech generation. The model comprises three components: (1) a conformer-based encoder, initialized from self-supervised pre-training, maps source speech to high-level acoustic representations; (2) a causal transformer decoder with relative position multi-head attention translates these representations into discrete target speech units; (3) a unit-based neural vocoder generates waveforms from the predicted discrete units. To mitigate the data scarcity problem, we construct a new Persian-English parallel speech corpus by translating Persian speech transcriptions into English using a large language model and then synthesizing the corresponding English speech with a state-of-the-art zero-shot text-to-speech system. The resulting corpus increases the amount of available parallel speech by roughly a factor of six. On the Persian-English portion of the CVSS corpus, the proposed model achieves improvement of 4.6 ASR BLEU with the synthetic data over direct baselines. These results indicate that combining self-supervised pre-training, discrete speech units, and synthetic parallel data is effective for improving direct S2ST in low-resource language pairs such as Persian-English

artificial intelligence, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2511.1269

Country: Asia > Middle East > Iran (0.14)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Incorporating L2 Phonemes Using Articulatory Features for Robust Speech Recognition

Wang, Jisung, Lee, Haram, Oh, Myungwoo

arXiv.org Artificial IntelligenceJun-4-2023

The limited availability of non-native speech datasets presents a major challenge in automatic speech recognition (ASR) to narrow the performance gap between native and non-native speakers. To address this, the focus of this study is on the efficient incorporation of the L2 phonemes, which in this work refer to Korean phonemes, through articulatory feature analysis. This not only enables accurate modeling of pronunciation variants but also allows for the utilization of both native Korean and English speech datasets. We employ the lattice-free maximum mutual information (LF-MMI) objective in an end-to-end manner, to train the acoustic model to align and predict one of multiple pronunciation candidates. Experimental results show that the proposed method improves ASR accuracy for Korean L2 speech by training solely on L1 speech data. Furthermore, fine-tuning on L2 speech improves recognition accuracy for both L1 and L2 speech without performance trade-offs.

artificial intelligence, speech, speech recognition, (17 more...)

arXiv.org Artificial Intelligence

2306.02534

Country:

North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Asia > South Korea (0.04)

Genre: Research Report > New Finding (0.34)

Technology: Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)

Add feedback

M-SpeechCLIP: Leveraging Large-Scale, Pre-Trained Models for Multilingual Speech to Image Retrieval

Berry, Layne, Shih, Yi-Jen, Wang, Hsuan-Fu, Chang, Heng-Jui, Lee, Hung-yi, Harwath, David

arXiv.org Artificial IntelligenceApr-10-2023

This work investigates the use of large-scale, English-only pre-trained models (CLIP and HuBERT) for multilingual image-speech retrieval. For non-English image-speech retrieval, we outperform the current state-of-the-art performance by a wide margin both when training separate models for each language, and with a single model which processes speech in all three languages. We identify key differences in model behavior and performance between English and non-English settings, attributable to the English-only pre-training of CLIP and HuBERT, and investigate how fine-tuning the pre-trained models impacts these differences. Finally, we show that our models can be used for mono- and cross-lingual speech-text retrieval and cross-lingual speech-speech retrieval, despite never having seen any parallel speech-text or speech-speech data during training.

artificial intelligence, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2211.0118

Country:

North America > United States > Texas > Travis County > Austin (0.04)
Europe > Poland (0.04)
Asia > Taiwan (0.04)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.68)

Add feedback

A model that can recognize speech in different languages from a speaker's lip movements

#artificialintelligenceNov-30-2022, 09:45:15 GMT

In recent years, deep learning techniques have achieved remarkable results in numerous language and image-processing tasks. This includes visual speech recognition (VSR), which entails identifying the content of speech solely by analyzing a speaker's lip movements. While some deep learning algorithms have achieved highly promising results on VSR tasks, they were primarily trained to detect speech in English, as most existing training datasets only include English speech. This limits their potential user base to people who live or work in English-speaking contexts. Researchers at Imperial College London have recently developed a new model that can tackle VSR tasks in multiple languages.

lip movement, recognize speech, speech, (10 more...)

#artificialintelligence

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

A model that can recognize speech in different languages from a speaker's lip movements

#artificialintelligenceNov-26-2022, 01:55:12 GMT

artificial intelligence, machine learning, speech, (5 more...)

#artificialintelligence

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback